Using a Random Forest Classifier to Compile Bilingual Dictionaries of Technical Terms from Comparable Corpora
نویسندگان
چکیده
We describe a machine learning approach, a Random Forest (RF) classifier, that is used to automatically compile bilingual dictionaries of technical terms from comparable corpora. We evaluate the RF classifier against a popular term alignment method, namely context vectors, and we report an improvement of the translation accuracy. As an application, we use the automatically extracted dictionary in combination with a trained Statistical Machine Translation (SMT) system to more accurately translate unknown terms. The dictionary extraction method described in this paper is freely available 1.
منابع مشابه
Combining String and Context Similarity for Bilingual Term Alignment from Comparable Corpora
Automatically compiling bilingual dictionaries of technical terms from comparable corpora is a challenging problem, yet with many potential applications. In this paper, we exploit two independent observations about term translations: (a) terms are often formed by corresponding sub-lexical units across languages and (b) a term and its translation tend to appear in similar lexical context. Based ...
متن کاملAutomatic Methods for the Extension of a Bilingual Dictionary using Comparable Corpora
Bilingual dictionaries define word equivalents from one language to another, thus acting as an important bridge between languages. No bilingual dictionary is complete since languages are in a constant state of change. Additionally, dictionaries are unlikely to achieve complete coverage of all language terms. This paper investigates methods for extending dictionaries using non-aligned corpora, b...
متن کاملExtracting Bilingual Persian Italian Lexicon from Comparable Corpora Using Different Types of Seed Dictionaries
Ebrahim Ansari ([email protected]) et al. 2017. Extracting bilingual per-sian italian lexicon from comparable corpora using different types of seed dictionaries. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lex...
متن کاملUse of the Japio Technical Field Dictionaries and Commercial Rule-based Engine for NTCIR-PatentMT
Japio performs various patent-related translation businesses, and owns the original patent-document-derived bilingual technical term database (Japio Terminology Database) to be used by the translators. Currently the database contains more than 1,900,000 J-E bilingual technical terms. The Japio Technical Field Dictionaries (technical-field-oriented machine translation dictionaries) are created f...
متن کاملAutomatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora
This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The qual...
متن کامل